4 Model Building and Selection
In this module, we explore strategies for building effective regression models. We cover: - Increasing model complexity to capture non-linear relationships and interactions, - Assessing model fit while balancing complexity, - Techniques for selecting parsimonious models.
4.1 Increasing model complexity to model complex relationships
4.1.1 Why increase model complexity?
We began this course with the simple linear regression model: a single continuous predictor with a straight-line effect. We then expanded to multiple predictors, where each predictor still had a straight-line effect. For reasons that will become clear, these are examples of so-called first-order models:
Key-term 1: First-order model
This first-order structure is the simplest linear model we can make with \(X\). It is often a reasonable starting point, and can go a long way to modeling real-world phenomena (especially in the multiple regression case), but real data are often more complex. Real relationships between predictors and outcomes are often non-linear, or involve interactions between predictors. To capture these more complex relationships, we can increase model complexity by adding ‘higher-order’ and ‘interaction’ terms.
Example 1: Non-linear relationships in data
For example, fitting a straight line to the following data is not ideal, as the relationship between \(X\) and \(Y\) is curved (U-shaped), so there is always part of the relationship the line cannot capture:
4.1.2 Square, Cubic, and Higher-Order Univariate Models
We enrich the basic linear model by adding powers of a predictor, allowing the fitted relationship to bend. In the above case, the “U” shape suggests a quadratic (squared) term may be appropriate.
Key-term 2: Second-order (Quadratic) univariate model
Example 2: Seeing how \(\beta_2\) bends the curve
Adjust the slider for the squared term to see how changing \(\beta_2\) adds curvature to the fitted relationship . Try to find a good fit - can you guess what model generated this data?.
4.1.2.1 Third-order (cubic) univariate model
\[ E[Y] = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 \]
allows more complex curvature with two changes in direction (an “S”-shape) that a quadratic term cannot capture.
Example 3: Seeing how \(\beta_3\) twists the curve
Adjust the slider for the cubic term to see how changing \(\beta_3\) adds an inflection point to the fitted relationship. Can you tune the parameters to recover the curve that generated this data?
4.1.2.2 N-th order univariate model
We can extend this process of deriving new terms by adding further powers of \(X\) - allowing ius to to fit arbitrarily complex curves: \[ E[Y] = \beta_0 + \beta_1X + \dots + \beta_nX^n. \]
Note that each term gets its own parameter (\(\beta_i\)). If we have \(n\) data points, a (n-1)^th-order model will have a parameter for each point, meaning it will fit the data perfectly!
Example 4: Fitting higher-order polynomial models
Higher-order terms increase the model’s flexibility, but also increase the risk of fitting random noise rather than meaningful structure. We will return to this important issue later in the module.
Note: The meaning of “linear” in linear models
4.1.2.3 Fitting higher-order univariate models in R
In R, higher-order polynomial terms can be included using the I() function to indicate ‘as is’ operations. For example, to fit a quadratic model:
lm(y ~ x + I(x^2), data = nonlinearData)
Call:
lm(formula = y ~ x + I(x^2), data = nonlinearData)
Coefficients:
(Intercept) x I(x^2)
2.6185 -3.3760 0.8218
or using the poly() function:
lm(y ~ poly(x, 2), data = nonlinearData)
Call:
lm(formula = y ~ poly(x, 2), data = nonlinearData)
Coefficients:
(Intercept) poly(x, 2)1 poly(x, 2)2
0.3493 0.3941 3.0208
4.1.3 Interaction Models with Continuous Predictors
In ?sec-slr we saw a multiple regression model with two continuous predictors:
\[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2. \]
This again is a first-order model - each predictor appears only once, ‘as is’. Of course, we can add higher-order terms for each predictor separately (e.g., \(X_1^2\), \(X_2^3\)) to capture curvature in their individual effects as we did above. e.g. A second-order multivariate model with two predictors might look like: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1^2 + \beta_4X_2^2. \]
Now however, we can also consider a different form of higher order term: what happens when we multiply predictors together? This gives us an interaction term - a term that combines two (or more) predictors.
Key-term 3: Interaction term
Example 5: Fitting an interaction model to the advertising dataset.
advertising dataset from ?sec-mlr, our fist-order linear model did not fit the data in particular ways:
ads <- read.csv("Data/Advertising.csv")
ads_mod <- lm(Sales ~ TV + Radio, data = ads)The plane does not fit the data at the edges - it overestimates sales when either Radio or TV advertising is low (separately), but underestimates sales when both are high. This suggests that the effect of increasing one type of advertising depends on the level of the other type - an interaction effect.
We can fit a second-order interaction model to capture this: \[ E[Sales] = \beta_0 + \beta_1TV + \beta_2Radio + \beta_3(TV \times Radio). \]
The fitted interaction model has a charateristic ‘saddle’ shape, which can better capture the data structure:
The interaction effect can also be visualised by fixing one predictor and plotting the relationship between the other predictor and the outcome. In this case, we can plot lines representing the relationship between TV advertising and Sales when Radio advertising is low ($0 spent), average (i.e. mean(ads$Radio)= 23.264 thousand dollars ), and high (i.e. max(ads$Radio)= $49.6 thousand dollars):
We can see that the relationship between TV advertising and Sales (represented by the slope of the regression lines) is different at different levels of Radio advertising. In particular, as radio advertising increases, the slope of the line increases, indicating that TV advertising has a larger effect on Sales when Radio advertising is also high.
When our first order model has more than two predictors, we can include interaction terms between any pair (or more) of predictors. For example, with three predictors \(X_1\), \(X_2\), and \(X_3\), a second-order interaction model would include interactions between each pair: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3. \]
4.1.3.1 Higher-order interaction models
We can also consider higher-order interactions - those that include combinations of three or more predictors. For example, a third-order interaction model with three predictors would include the ‘three-way’ interaction term: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3 + \beta_7X_1X_2X_3. \] The \(\beta_7\) parameter here captures how the interaction between two predictors changes depending on the level of the third predictor.
As in the case of higher-order univariate models, we can extend this idea to include both higher-order interaction terms and higher-order univariate terms for each predictor, allowing for very flexible modeling of complex relationships.
However, as before, increasing model complexity in this way raises the risk of overfitting and interpretability challenges, which we will discuss later in this module.
4.1.3.2 Fitting interaction models in R
In R, interaction terms can be included using the * operator in the formula. For example, to fit a second-order interaction model with two predictors:
lm(Y ~ X1*X2*X3, data = mydata)This expands to include both main effects and the interaction term. To include only the interaction term without main effects, use the : operator. For example, if we only wanted the interaction between X1 and X2, along with the linear terms for X1, X2, and X3:
lm(Y ~ X1+X2+X3+X1:X2, data = mydata)4.1.4 Interaction Models with Categorical Predictors
A categorical variable with \(k\) levels is represented by \(k-1\) dummy variables: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \dots + \gamma_{k-1}D_{k-1}. \]
- Each \(\gamma_j\) measures the difference between level \(j\) and the baseline.
- Changing the baseline changes interpretations but not fitted values.
Categorical predictors may also interact with continuous predictors: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \delta_1(XD_1), \] allowing each group to have its own slope.
Note
4.2 Assessing model fit and model complexity
4.2.1 Why not increase model complexity?
In the previous section we saw how adding higher-order terms and interactions can help us capture complex relationships in data. However, increasing model complexity is not always beneficial. While more complex models can fit the training data (i.e. the data we use to fit our model) better, they may not generalise well to new data. This is because complex models can start to fit random noise in the training data, rather than the underlying signal. This phenomenon is known as overfitting.
In addition, more complex models can be harder to interpret, making it difficult to understand the relationships between predictors and the outcome. They may also be more sensitive to outliers and multicollinearity, leading to unstable parameter estimates.
A good model is no more complex than necessary to describe the main structure of the data. Therefore, when building regression models, we need to balance the trade-off between model fit and model complexity. In this section, we discuss some common issues that arise with complex models, and metrics for assessing model fit while accounting for complexity.
Key-point: Balancing fit and complexity
4.2.2 Fit Metrics
Fit metrics quantify how well a model captures patterns in the data. We have already seen one such metric: the coefficient of determination, \(R^2\).
Key-term 4: Coefficient of Determination (\(R^2\))
\(R^2\) was introduced in ?sec-slr as a measure of fit for simple linear regression models based soely on the proportion of variance in \(Y\) explained by \(X\). We can also calculate \(R^2\) for multiple regression models, where it measures the proportion of variance in \(Y\) explained by all predictors jointly. However, \(R^2\) has a key limitation: it always increases (or at least does not decrease) when additional predictors are added to the model, even if those predictors do not meaningfully improve the model’s explanatory power. This can lead to overfitting, where a model appears to fit the training data well but performs poorly on new data - for example, the (n-1)th order polynomial model discussed in ?sec-ho_univariate has A SSres of zero and thus an \(R^2\) of 1, but is unlikely to generalise well.
4.2.2.1 Adjusted \(R^2\)
To address this limitation, we can adjust \(R^2\) to penalise the addition of unnecessary predictors. The adjusted \(R^2\) introduces a penalty term based on the number of predictors and the sample size, providing a more balanced measure of model fit that accounts for complexity.
Key-term 5: Adjusted \(R^2\)
summary(lm_model)$adj.r.squared.
Adjusted \(R^2\) is particulary useful because of it’s standardised scale (0 to 1), and the interpretation of \(R^2\) as the proportion of variance explained still holds - now with the added benefit of penalising unnecessary complexity. However, it is important to note that adjusted \(R^2\) is just one of many metrics available for assessing model fit while accounting for complexity.
4.2.2.2 Information Criteria: AIC, BIC
Another common approach to balancing model fit and complexity is to use information criteria, such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These criteria provide a quantitative way to compare models, penalising those with more parameters.
Key-term 6: Akaike Information Criterion (AIC)
AIC() function.
Key-term 7: Bayesian Information Criterion (BIC)
BIC() function.
Both AIC and BIC penalise model complexity, but BIC generally imposes a larger penalty for additional parameters, especially in larger samples. The choice between AIC and BIC often depends on the context and goals of the analysis. AIC is generally preferred for predictive accuracy, while BIC is more conservative and favours simpler models.
Unlike adjusted \(R^2\), AIC and BIC do not have a fixed scale, so their absolute values are not interpretable on their own. Instead, they are used to compare models fitted to the same dataset - the model with the lowest AIC or BIC is considered the best among the candidates.
4.2.2.3 Calculating fit metrics with broom::glance()
In R, we can calculate adjusted \(R^2\), AIC, and BIC using the broom package’s glance() function, which provides a tidy summary of model fit statistics. For example, using our fitted 2nd-order interaction model of the advertising dataset:
library(broom)
glance(ads_mod_int)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.968 0.967 0.944 1963. 6.68e-146 3 -270. 550. 567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Note that glance() returns a data frame with the mentioned fit metrics, along with other useful statistics such as the number of observations (nobs), residual standard error (sigma) and the global F-statistic (statistic). Alongside tidy(), glance() is a powerful tool for summarising and comparing regression models in R.
4.2.2.4 Comparing models using fit metrics
When comparing multiple models fitted to the same dataset, we can use adjusted \(R^2\), AIC, and BIC to assess which model provides the best balance between fit and complexity: the key points to remember are:
- Higher adjusted \(R^2\) values indicate better fit.
- Lower AIC and BIC values indicate better fit.
Example 6: Comparing first-order and interaction models
Lets continue our investigation of the advertising dataset from ?sec-mlr by comparing our first-order model of Sales (ads_mod) to our second-order interaction model (ads_mod_int).
Lets combine the fit metrics from both models into a single table for easy comparison:
# A tibble: 2 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.897 0.896 1.68 860. 4.83e- 98 2 -386. 780. 794.
2 0.968 0.967 0.944 1963. 6.68e-146 3 -270. 550. 567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ads_fit_glance <- rbind(
glance(ads_mod) |> `rownames<-`("First-order model"),
glance(ads_mod_int) |> `rownames<-`("Interaction model"))
ads_fit_glanceSince the interaction model ?eq-ads_int differs from the first-order model ?eq-ads_mod by the addition of only the \(TV \times Radio\) interaction term, we can interpret the change in fit metrics as the change in model fit due to adding this term. Lets go through each fit metric in turn: - \(R^2\): Since the RSE (sigma) decreases, \(R^2\) increases from approximately 0.897 to 0.915, indicating that the interaction model explains more variance in Sales than the first-order model (as expected). Moreover, - Adjusted \(R^2\) also increases a similar ammount, suggesting that the improvement in fit is substantial enough to outweigh the penalty for adding an additional parameter. - AIC decreases from approximately 315.6 to 308.3, indicating that the interaction model provides a better balance between fit and complexity. - BIC also decreases from approximately 322.2 to 317.4, further supporting the conclusion that the interaction model is preferable.
This seems pretty clearcut: the interaction model provides a much better fit to the data (as we saw in our plots of ?sec-ads_int), and the fit metrics all indicate that the improvement in fit justifies the added complexity of the interaction term. But can we improve our model further by adding more terms?
Lets fit the ‘full second-order model’ including all main effects and the interaction term, as well as quadratic terms for both TV and Radio advertising:
ads_mod_full <- lm(Sales ~ TV + Radio + I(TV^2) + I(Radio^2) + TV:Radio, data = ads)
ads_fit_glance |> rbind(
glance(ads_mod_full) |> `rownames<-`("Full second-order model")
)# A tibble: 3 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.897 0.896 1.68 860. 4.83e- 98 2 -386. 780. 794.
2 0.968 0.967 0.944 1963. 6.68e-146 3 -270. 550. 567.
3 0.986 0.986 0.624 2740. 8.17e-178 5 -187. 387. 410.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
anova(ads_mod_int, ads_mod_full) |> tidy() # A tibble: 2 × 7
term df.residual rss df sumsq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sales ~ TV + Radio + TV:Rad… 196 174. NA NA NA NA
2 Sales ~ TV + Radio + I(TV^2… 194 75.6 2 98.9 127. 6.06e-36
4.3 A model building workflow
4.3.1 Exploratory data analysis and theory-driven model development
4.3.2 Automated model selection methods
4.3.3 Stepwise Regression
Stepwise methods provide automated ways to search for simpler models.
4.3.3.1 Forward Selection
Begin with a minimal model (often the intercept). Add predictors one at a time when they improve AIC or adjusted \(R^2\).
4.3.3.2 Backward Selection
Begin with a saturated model containing all candidate predictors. Remove predictors that do not meaningfully contribute.
Note
4.4 Exercises
Exercise 1
A scatterplot of \(Y\) vs. \(X\) shows curvature.
Fit a straight-line model and inspect residuals.
Fit a model including \(X^2\).
Compare AIC and adjusted \(R^2\) between the models.
Discuss whether the added complexity is justified.
:::